Gentrification is a major issue not just in NYC today, but also in other major cities throughout the world. Gentrification is a phenomenon whereby deteriorating neighborhoods are renewed and rebuilt and is accompanied the influx of middle-class or affluent residents, often displacing poorer residents. As wealthier residents move in real-estate prices gradually increase, more businesses enter to serve them, and low-income families begin to leave and the complexion of the neighbourhood changes. Williamsburg in Brooklyn, NYC is a prime contemporary example of gentrification. Over the last decade, housing prices have skyrocketed and it is now considered both a trendy and expensive location for the upwardly mobile citizen.
Similar transformations have been witnessed across neighbourhoods in other countries including Vancouver, Canada; London, U.K.; Paris, France; Capetown, South Africa. In the US itself, gentrification has occured in of in South Boston, the Central District of Seattle, the Mission District of San Francisco and the H Street Corridor in Washington, to name a few. Gentrification is a very controversial topic in urban planning. Some have a negative view of it since it is changing the heritage and culture of a community and causing the exit of historical residents. Others believe gentrification is a dynamic phenomenon, and is not necessarily a bad thing. Indeed, there is one opinion that indicates that gentrification may be a reversion to the original status as many neighbourhoods that are gentrified today used to be havens of the rich in the past.
Gentrification has significant consequences. This phenomenon is accompanied by changes in transportation, affordable housing, education and health care resources, as well as the nature of businesses and real estate investment in the community. The culture of the community, and political representation also change as a result. For these reasons, gentrification is a much talked about issue, and there is much interest in understanding the causes of this phenomenon and ways in which it can be predicted.
Given the reasons just mentioned, we decided to focus our analysis of 311 NYC complaint data on this issue of gentrification. In NYC, gentrification has or is being witnessed in Williamsburg, Fort Greene, Harlem and Bedford-Stuyvesant among many other neighbourhoods, to name a few. As residents of NYC, we witness first hand the changing nature of neighbourhoods around us. As a group, we found this really fascinating and this is why we chosen gentrification as our topic. Our goal was to predict gentrification in New York City by making use of the 311 NYC complaint data.
Note: In this project, our goal is to identify the extent of gentrification in a given neighbourhood of NYC. Our strategy is to identify robust proxies of gentrification and use this information to make our prediction. We are not asserting that there is a causal relationship between our chosen indicators and gentrification, but rather, a correlation. This is an important point to make note of at the outset.
Broadly speaking, the following list of steps sum up our approach towards this project.
Identify robust predictors of gentrification based on research and intuition. Then identify data sources and clean and merge data sets, making them usable.
Verify choice of features through exploratory analysis. Reform model strategy accordingly.
Model gentrification using selected features
Run model on test set.
Use unsupervised learning methods.
Reform model if required
We now explain what each of these steps entailed.
This project required us to work with 311 NYC data for the years 2010-2016. However, we decided to use more data, since we wanted to identify gentrification more accurately. This required having data across a longer period of time, which would facilitate the identification of a clear trend. Therefore, we decided to work with 311 data from 2004-2016. This meant that we were working with a almost 12GB of data.
Before commencing with the cleaning of the data, it was important to have a broad sense of the features we would be working with. These features are variables we considered to be highly correlated with gentrification. Having a sense of this allowed us to identify which data sets we would be working with and what tasks needed to be performed during the cleaning phase.
Initially, we identified the following six variables, as strong proxies of gentrification:
Real Estate Prices: Our hypothesis being that a large annual increase in average real estate price in a neighbourhood was a signal of increased gentrification
Income levels: Our hypothesis being that a large annual increase in average income levels in a neighbourhood was a signal of increased gentrification
NYC 311 Complaints: We believe that particular types of complaints and their attributes (frequency, timing, etc) are strongly associated with high/low gentrification - we explain more on this later
Crime level: Our hypothesis being that high crime levels in a neighbourhood was a sign of low gentrification and vice-versa
Business/Occupation: We believe that small businesses, mom-and-pop shops are associated with low gentrification and vice-versa
Unemployment level: Our hypothesis was that high unemployment levels was a robust proxy for low gentrification in that neighborhood and vice versa.
Once we identified features, the next step was to gather the data. We used the following sources to acquire this data:
| Feature | Data Source |
|---|---|
| Income | IRS |
| Housing Price | zillow NYC |
| 311 Complaints types | NYC 311 Complaint Data |
| Crime Data | National Incident-Based Reporting System |
| Business/Occupation | Zillow NYC |
| Unemployment Rate | Bureau of Labor Statistics |
Making this data useable for inference was a greater challenge than we had anticipated. For example, much to our chagrin, we discovered that for the years 2004-2010, zipcode data was not consistent. Also, class intervals for income levels were not the same across the years. Therefore, we had to work with each year’s data set separately and standardize these aspects before finally merging the data sets. This required a lot of effort and time. We also appended 311 complaint data from 2004-2009 to the data set which was provided.
Unfortunately, we had to drop crime, unemployment and business as features since the data sets we found were not conducive for our analysis. For example, these data sets didn’t have zip code information. The location info was provided as latitude-longitude and this made it impossible for us to merge with the other data sets because google has a limitation of 2,500 conversions from latitude-longitude to zip. The crime data alone had over 100,000 latitude-longitude points. We could have run this analysis, but we chose to limit our efforts based uponthe time and financial constraints.
Selecting 311 Complaint Types:
We decided to work with the following 7 complaints related to: a) Mold b) Dirty conditions c) Drinking water d) Homeless encampment e) Water quality f) Indoor air quality g) Urinating in public
There are two main reasons for this. First, we believe that these complaints are associated with the level of or change in gentrification in a neighbourhood. Secondly, we found data for these variables from 2004-2010, which we required for our analyses.
We believe that housing prices are the most prominent signal for gentrification. In this section, we focus on the housing costs and income of different zip codes, separated by borough. Zillow is a website that publicizes the listing and sale prices of homes across the US. Due to limitations in Zillow data, we were not able to acquire rental data before 2010. Thus, our housing prices are all based on sales of homes only. Given more time to search for other data sources, we would like to incorporate rental prices into our future models.
First, let us explore the differences in income of the four main boroguhs in New York: Bronx, Brooklyn, Manhattan, and Queens.
Manhattan clearly has the highest variation in income. We could have taken logs of the income to make it more evenly distributed, but that would be difficult to interpret. Instead, we wondered what would happen if we removed all of the already-wealthy neighborhoods. There is a distinct cutoff in overall income between high income neighborhoods like Chelsea, Soho, and the Upper East/West Side and low income neighborhoods like Harlem, Inwood, and the Lower East Side. The former are areas that have no more potential for further gentrification, whereas the latter, we expect, are seeing rising housing prices and income. The remaining neighborhoods are seen below.
Now that we have isolated the four Manhattan neighborhoods with the lowest income levels, we explore their housing costs. First, we plot the Median Sale Price per square foot for each neighborhood. This is a metric that gives us a baseline of how expensive housing is per unit, eliminating differences due to number of bedrooms or overall size.
East Harlem and the Lower East Side have zip codes with price per square foot above the $1000 mark. Central Harlem and Inwood/Washington Heights are not trailing too far behind. Central Harlem in particular seems to be seeing an upward trend in the last few years.
Another metric we have is the Percent of Homes Increasing in Value, calculated between the previous year and current year of measurement. This allows us to see where housing prices may be increasing the most, or where there is increasing demand for buyers who think it is a good investment. Increasing home values is a potential sign of gentrification, as more demand for housing means the area is becoming more desirable for home ownership, renovations, and subsequent increase in rents.
Finally, the turnover rate of houses (the percentage of homes in the area that were sold that year) may give us further insight into gentrification and the overall health of the housing market. Unfortunately, there are few distinct trends in these lines, aside from a slightly increasing turnover rate in East Harlem. The huge 50% turnover in zip code 10037 in 2008-9 is likely due to a large-scale acquisition or takeover between management companies, perhaps a sign of the risk and turbulence after the financial crisis.
Now we turn to Brooklyn, where it is common knowledge that gentrification is taking places in several notoriously hip, young neighborhoods.
Again, we first look at income levels across different neighborhoods as baseline estimate of wealth in the different areas. The income in many of these areas is well below $50,000 annually. In some areas, the income is increasing faster than in others. We will see later in housing prices that there is not necessarily a correlation between income and housing price - particularly in Bushwick and Williamsburg.
For Brooklyn’s housing data, we see significant variations in price per square foot, and we can easily use visualizations to separate out the more expensive neighborhoods. Bushwick/Williamsburg and Northwest Brooklyn (around DUMBO) lead with prices around $750, and Greenpoint and Central Brooklyn (around Prospect Park) trail closely behind. These neighborhoods all show signs of fast rates of increase in housing prices. The zip code of Williamsburg, seen as the orange line, especially sees an uptick of prices in the last 7 years. Other neighborhoods fluctuate less and are more stagnant in overall trends.
The plot of Percent of Homes Increasing in Value in Brooklyn, seen below, seems much more affected by the 2008 financial crisis than Manhattan above. We also observe several zip codes that have almost no change in this percentage, possibly due again to management company reporting if they own a large area of homes. What is most interesting here is that there seems to be little relation between this plot and the previous one with final home sale prices per square foot. The most intuitive explanation that we hypothesize is that for homes with increasing values, people are holding onto them rather than choosing to sell.
This idea that housing is being artificially held onto during a period of increasing home value is further supported by the turnover rate below. We see here that the more highly valued and increasingly priced neighborhoods do not necessarily have high turnover rates. In comparison to Manhattan, the turnover rates are at least a solid 5% less (centered around 10% rather than 15%). This is only a hypothesis though, and future tests and additional data would be needed to look further back historically when there was less gentrification in Manhattan and comparing those trends to Brooklyn in the more recent decade.
All of the housing price analysis is a good start and has provided important insights into the housing markets of Manhattan and Brooklyn, and may potentially help predict gentrification in these neighborhoods and zip codes.
However, the greatest weakness of this analysis is in our lack of rent data. In most of these gentrifying areas, the first population to move in is young professionals or artists who rent rather than buy homes. Additionally, the factor that drives out poorer demographics from these areas is also the resulting rise of rent price. We wish to continue our investigation after incorporating more sources of data, which timing did not allow for this report.
The Sankey plot shows the aggregated view of our selected 311 complaint types for all New York City boroughs. While this plot does not show change over time - a feature we would like to add later - it still offers an interesting comparison between the boroughs. In our housing price and income analysis, we focused on Manhattan and Brooklyn because those were the boroughs with the most interesting changes and trends. There seemed to be less gentrification in the other boroughs like Queens and Bronx, or at least they did not represent the varying stages of early to late gentrification, and thus we left them out of the final report.
In this Sankey plot, the most frequent complaint type was Dirty Conditions, which was reported by each borough. Water Quality was also reported by all, in proportion to the total borough complaint count. These two complaints are covered by the DSNY and DEP, the two New York agencies that cover sanitation and environmental protection.
The DOHMH and NYPD handle home conditions and public behavior. The complaints that these agencies handled are relatively more frequent in Brooklyn and Manhattan. The NYPD complaints, covering homeless encampment and public decency are especially high in Manhattan relative to other complaints. These differences in complaint distributions across different boroughs may speak to their stages of gentrification. In Manhattan, the most wealthy borough, there may be more homeless people as they can perhaps get more money or find more food. It is also possible that rich people tend to complain more about homeless encampments or drunk people getting in their way or making their neighborhood feel less clean and safe.
Indoor air quality and mold complaints may be a signal of an earlier stage of gentrification, as it is both frequent in Manhattan and Brooklyn. In older, unrenovated apartments, mold and air quality tends to be worse due to old materials, cracking paint, and other forms of dilapidation. While similarly worn-down apartments may exist in Queens and the Bronx, the higher rates of complaints in Brooklyn and Manhattan may be due to people requiring a higher standard of living. It is plausible that if someone sees their neighbors moving into a newly renovated apartment with fresh paint and flooring, they would want to have the same. Air quality reduction also could be caused by debris and dust from construction, which may be more frequent in areas with more renovations and reconstruction. More renovation and construction is another signal of gentrification.
We use the 7 types of 311-complaint data including a) Mold b) Dirty conditions c) Drinking water d) Homeless encampment e) Water quality f) Indoor air quality g) Urinating in public to plot the complaint map in New York City. The color bar represents the total number of these 7 type complaints. Darker area means more number of 311 complaints. The black area means the number of complaints is missing in the dataset. This visualization reaffirmed that there may be some connection between 311 complaint data and gentrification as the areas surrounding Williamsburg see an obvious decline in health complaints over the 10 year period. The challenge was that we lumped all 7 complaints together, and we couldn’t see if some were rising while others were falling. As a result, we decided to separate each complaint which shows below.
From the following plots, we can see the trend of complaint numbers in five boroughs of New York City in different years.
The most complaints are from dirty conditions, which concentrated on the time of financial crisis and mostly happened in the Queens and Brooklyn. Another type that was related to the financial crisis is indoor air quality. They were mostly found in the Manhattan. Compared to two types above, the homeless encampment complaint reached its lowest point in 2010, with the same most frequent place Manhattan. For the water quality complaint, the Queens and Brooklyn took the lead of complaint numbers again, especially in 2007. The drinking complaint has the same trend and distribution as the water quality complaint but it happened in Manhattan as well. For the urinating in public complaint, they happened in the Manhattan mostly, without any obvious change during these years. For the mold complaint, there was a drastic drop after 2005 and since then, this kind of complaint has kept silence.
As a part of the aim of this project, we wanted to identify the indicators of gentrification and create a score based on the data we had. We found that change in the per capita income of the area as well as the change in housing prices were important for determining how gentrified the area was. The 311-data, however, was not as clear and we required a method to allow us to determine how the different fields in the data were related to gentrification. As we had many zip codes and 7 complaint types per zip coe, we decided to use a subset of zip codes that had the most change in housing and income prices. We determined the average change in income and housing across all of NYC, and used it as a threshold, eliminating those closest to the mean from our analysis. These zips were elimiated because, at its core, gentrification is a change in income in housing that is outside the norm.
In order to formulate a gentrification score, we found the correlation between different predictors. Using the rate of change of income and the rate of change of housing as an indication of gentrification, we compared the rate of change of the different 311 data fields to it. Some showed a direct correlation, some an inverse correlation and some no correlation. The water and dirty values showed a direct correlation whereas the drinking values showed an inverse correlation. The rest of the fields showed no correlation. The graphs are given below.
Correlation Analysis: The above graphs show the correlation between different 311-data types and the income+housing changes. We see all the above graphs to peak around 2007-2009. They all have a similar graph, increasing in the initial years and then seeing a descent towards the end. The 311 data types that correlate with the change in income+housing are the water data, dirty conditions data and the inverse of the drinking data.
The gentrification score is a linear combination of the changes in income, housing, water values, dirty values and inverse change in drinking values. We found that using equal weights gives a good score. You can find the gentrification score that we calcualted for each zip code below:
Note: We used python to generate the gentrification score, please see python code in the folder if you are interested.
Gentrification Score for each zip code
Now we use the gentrification score above as data for our prediction model. We use locally polynomial regression fitting model to predict the future gentrification score according to different zip codes.
The local regression model (LOESS model) is used for fitting a smooth curve between two variables, or fitting a smooth surface between an outcome and up to four predictor variables. The procedure originated as Locally Weighted Scatter-plot Smoother. Since then it has been extended as a modelling tool because it has some useful statistical properties. At each point in the range of the data-set a low-degree polynomial is fitted to a subset of the data, with explanatory variable values near the point whose response is being estimated. The polynomial is fitted using weighted least squares, giving more weight to points near the point whose response is being estimated and less weight to points further away. The value of the regression function for the point is then obtained by evaluating the local polynomial using the explanatory variable values for that data point. The LOESS fit is complete after regression function values have been computed for each of the data points.
Here we use the gentrification score from years 2004 to 2013, ten-years data to train the LOESS model. The degree and span parameter in the LOESS model is obtained by the cross-validation method. Here we use degree = 2 and span = 0.8 in our LOESS model. After we build the model, we try to predict the gentrification score in 2014 and 2015. To validate our model, We select zip code 10025 (Mornigside Heights) and 11211 (Williamsburg).
Original score for 2004-2015
Prediction score for 2014 & 2015
From the above data, the prediction result of gentrification score is reasonable according to our result. It can capture the general changing trend of the gentrification score in different areas. The overall error rate is around 30% when comparing the prediction of the gentrification score and the actual gentrification score. For example, the zipcode 10025, which is around Columbia University, has a predicted gentrification score of 1.610619, 1.558206 for year 2014 and 2015, while the actual gentrification score is 1.654007, 1.798823 for year 2014 and 2015, and the error rate is below 20%. For the high gentrification area like Williamsburg neighborhood (zipcode 11211), the actual gentrification score in 2014 and 2015 is 1.836679, 1.982375 while the predicted gentrification score is 1.860064, 1.934579. The error rate is below 5%, which is amazing.
Our gentrification score model and the LOESS fitting model was able to capture the changing trend of gentrification in different areas in New York City. While we were able to observe, fairly accurately, gentrification using housing and income data, certain types of 311 complaints likely allowed for a more refined and accurate model. Initially, to quickly and easily refine the model further, we could look to the full body of complaint data to see if there are relationships between income and housing changes and the specific complaint type. The lack of readily usable data on rent, crime, business, and unemployment may not have factored in to the score and model in its current state, but their use would allow for the analysis of 1) the mobility of lower income residents in and out of neighborhoods who may not be able to afford higher rents or buying homes, 2) changes in crime numbers and types that correlate to price changes in income and housing, 3) changes in business types that correlate to price changes in income and housing and 4) the impact of changes in income, rents, and business types on highly localized unemployment rates. These data points would help to make the model more accurate, especially around the edges of neighborhoods around gentrification. The specific locality, rather than neighborhood or zip, of gentrification could become even more accurate if we changed address data to latitude-logitude data, allowing us to use the score to determine gentrifcation on a building by building basis. This would help in a vast array of business, civil, and cultural decisions, allowing current residents to predict gentrification in their neighborhoods and providing the opportunity to maintain the soul of the neghborhood while allowing for economic revitaliztion.